Assignment 1 (Section 20)

Instructions

  1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.

  2. Do not write your name on the assignment.

  3. Write your code in the Code cells and your answers in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to for the graders to understand and follow.

  4. Use Quarto to render the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: quarto render filename.ipynb --to html. Submit the HTML file.

  5. The assignment is worth 100 points, and is due on Wednesday, 24th January 2024 at 11:59 pm.

  6. There is a bonus question worth 15 points.

  7. Five points are properly formatting the assignment. The breakdown is as follows:

    • Must be an HTML file rendered using Quarto (1 point). If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file.
    • No name can be written on the assignment, nor can there be any indicator of the student’s identity—e.g. printouts of the working directory should not be included in the final submission. (1 point)
    • There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 point)
    • Final answers to each question are written in the Markdown cells. (1 point)
    • There is no piece of unnecessary / redundant code, and no unnecessary / redundant text. (1 point)
  8. The maximum possible score in the assigment is 100 + 15 (bonus question) + 5 (proper formatting) = 120 out of 100. There is no partial credit for some parts of the bonus question.

1) Case Studies: Regression vs Classification and Prediction vs Inference (16 points)

1a)

For each case below, explain (1) whether it is a classification or a regression problem and (2) whether the main purpose is prediction or inference. You need justify your answers for credit.

1b)

You work for a company that is interested in conducting a marketing campaign. The goal of your project is to identify individuals who are likely to respond positively to a marketing campaign, based on observations of demographic variables (such as age, gender, income etc.) measured on each individual. (2+2 points)

1c)

For the same company, now you are working on a different project. This one is focused on understanding the impact of advertisements in different media types on the company sales. For example, you are interested in the following question: ‘How large of an increase in sales is associated with a given increase in radio and TV advertising?’ (2+2 points)

1d)

A company is selling furniture and they are interested in the finding the association between demographic characteristics of customers (such as age, gender, income etc.) and if they would purchase a particular company product. (2+2 points)

1e)

We are interested in forecasting the % change in the USD/Euro exchange rate using the weekly changes in the stock markets of a number of countries. We collect weekly data for all of 2023. For each week, we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market. (2+2 points)

2) Examples for Different Regression Metrics: RMSE vs MAE (8 points)

2a)

Describe a regression problem, where it will be more proper to evaluate the model performance using the root mean squared error (RMSE) metric as compared to the mean absolute error (MAE) metric. You need to justify your answer for credit. (4 points)

Note: You are not allowed to use the datasets and examples covered in the lectures.

2b)

Describe a regression problem, where it will be more proper to evaluate the model performance using the mean absolute error (MAE) metric as compared to the root mean squared error (RMSE) metric. You need to justify your answer for credit. (4 points)

Note: You are not allowed to use the datasets and examples covered in the lectures.

3) Modeling the Petrol Consumption in U.S. States (61 points)

Read petrol_consumption_train.csv. Assume that each observation is a U.S. state. For each observation, the data has the following variables as its five columns:

Petrol_tax: Petrol tax (cents per gallon)

Per_capita_income: Average income (dollars)

Paved_highways: Paved Highways (miles)

Prop_license: Proportion of population with driver’s licenses

Petrol_consumption: Consumption of petrol (millions of gallons)

3a)

Create a pairwise plot of all the variables in the dataset. (1 point) Print the correlation matrix of all the variables as well. (1 point) Which variable has the highest linear correlation with Petrol_consumption? (2 points)

Note: Remember that a pairwise plot is a visualization tool that you can find in the seaborn library.

3b)

Fit a simple linear regression model to predict Petrol_consumption using the column you found in part a as the only predictor. Print the model summary. (4 points)

3c)

What is the increase in petrol consumption for an increase of 0.05 in the predictor? (4 points)

3d)

Does petrol consumption have a statistically significant relationship with the predictor? You need to justify your answer for credit. (4 points)

3e)

How much of the variation in petrol consumption can be explained by its linear relationship with the predictor? (3 points)

3f)

Predict the petrol consumption for a state in which 50% of the population has a driver’s license. (3 points) What are the confidence interval (3 points) and the prediction interval (3 points) for your prediction? Which interval is wider? (1 points) Why? (2 points)

3g)

Predict the petrol consumption for a state in which 10% of the population has a driver’s license. (3 points) Are you getting a reasonable outcome? (1 point) Why or why not? (2 points)

3h)

What is the residual standard error of the model? (3 points)

3i)

Using the trained model, predict the petrol consumption of the observations in petrol_consumption_test.csv (2 points) and find the RMSE. (2 points) What is the unit of this RMSE value? (1 point)

3j)

Based on the answers to part g and part h, do you think the model is overfitting? You need to justify your answer for credit. (4 points)

3k)

Make a scatterplot of Petrol_consumption vs. the predictor using petrol_consumption_test.csv. (1 point) Over the scatterplot, plot the regression line (2 points), the prediction interval (2 points), and the confidence interval. (2 points)

Make sure that regression line, prediction interval lines, and confidence interval lines have different colors. (1 point) Display a legend that correctly labels the lines as well. (1 point) Note that you need two lines of the same color to plot an interval.

3l)

Find the correlation between Petrol_consumption and the rest of the variables in petrol_consumption_train.csv. Which column would have the lowest R-squared value when used as the predictor for a Simple Linear Regression model to predict Petrol_consumption? Note that you can directly answer this question from the correlation values and do not need to develop any more linear regression models. (3 points)

4) Reproducing the Results with Scikit-Learn (15 points)

4a)

Using the same datasets, same response and the same predictor as Question 3, reproduce the following outputs with scikit-learn:

  • Model RMSE for test data (3 points)
  • R-squared value of the model (3 points)
  • Residual standard error of the model (3 points)

Note that you are only allowed to use scikit-learn, pandas, and numpy tools for this question. Any other libraries will not receive any credit.

4b)

Which of the model outputs from Question 3 cannot be reproduced using scikit-learn? Give two answers. (2+2 points) What does this tell about scikit-learn? (2 points)

5) Bonus Question (15 points)

Please note that the bonus question requires you to look more into the usage of the tools we covered in class and it will be necessary to do your own research. We strongly suggest attempting it after you are done with the rest of the assignment.

5a)

Fit a simple linear regression model to predict Petrol_consumption based on the predictor in Question 3, but without an intercept term. (5 points - no partial credit)

Without an intercept means that the equation becomes \(Y = \beta_1X\). The intercept term, \(\beta_0\), becomes 0.

Note: You must answer this part correctly to qualify for the bonus points in the following parts.

5b)

Predict the petrol consumption for the observations in petrol_consumption_test.csv using the model without an intercept and find the RMSE. (1+2 points) Then, print the summary and find the R-squared. (2 points)

5c)

The RMSE for the models with and without the intercept are similar, which indicates that both models are almost equally good. However, the R-squared for the model without intercept is much higher than the R-squared for the model with the intercept. Why? Justify your answer. (5 points - no partial credit)